Similarity searching of chemical databases using atom environment descriptors: evaluation of performance

نویسندگان

  • Andreas Bender
  • Hamse Y. Mussa
  • Robert C. Glen
  • Stephan Reiling
چکیده

A molecular similarity searching technique based on Atom Environments, information-gain based feature selection and the Naïve Bayesian Classifier has been applied to a series of diverse datasets and its performance compared to alternative searching methods. In this application, using a recently published dataset of more than 100,000 molecules from the MDL Drug Data Report (MDDR) database, the Atom Environment approach appears to outperform fusion of ranking scores as well as binary kernel discrimination, which are both used in combination with Unity fingerprints. Overall retrieval rates among the top 5% of the sorted library are nearly 10% better (more than 14% better in relative numbers) than the second best method, Unity fingerprints and binary kernel discrimination. In 10 out of 11 sets of active compounds the combination of Atom Environments and the Naïve Bayesian Classifier appears to be the superior method while in the remaining data set, data fusion and binary kernel discrimination in combination with Unity fingerprints are the method of choice. Binary kernel discrimination in combination with Unity fingerprints generally comes second in performance overall. The difference in performance can largely be attributed to the different molecular descriptors used. Atom Environments outperform Unity fingerprints by a large margin if the combination of these descriptors with the Tanimoto coefficient is compared. The Naïve Bayesian Classifier in combination with information-gain based feature selection and selection of a sensible number of features performs about as well as binary kernel discrimination in experiments where these classification methods are compared. When used on a monoaminooxidase data set, Atom Environments and the Naïve Bayesian perform as well as binary kernel discrimination in the case of a 50/50 split of training and test compounds. In the case of sparse training data, binary kernel discrimination is found to be superior on this particular data set. On a third data set, the Atom Environment descriptor is again shown to be superior to other 2D fingerprints when used in combination with the Tanimoto similarity coefficient. Feature selection is shown to be a crucial step in determining the performance of the algorithm. The representation of molecules by Atom Environments is found to be more effective than Unity fingerprints for the type of biological receptor similarity calculations examined here. Combining information prior to scoring, as in the Bayesian Classifier and binary kernel discrimination, is found to be superior to posterior data fusion (in the datasets tested here). Brief: A novel fast …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Expanding the fragrance chemical space for virtual screening

The properties of fragrance molecules in the public databases SuperScent and Flavornet were analyzed to define a "fragrance-like" (FL) property range (Heavy Atom Count ≤ 21, only C, H, O, S, (O + S) ≤ 3, Hydrogen Bond Donor ≤ 1) and the corresponding chemical space including FL molecules from PubChem (NIH repository of molecules), ChEMBL (bioactive molecules), ZINC (drug-like molecules), and GD...

متن کامل

An alphabetic code based atomic level molecular similarity search in databases

Atomic level molecular similarity and diversity studies have gained considerable importance through their wide application in Bioinformatics and Chemo-informatics for drug design. The availability of large volumes of data on chemical compounds requires new methodologies for efficient and effective searching of its archives in less time with optimal computational power. We describe an alphabetic...

متن کامل

Structure/Response Correlations and Similarity/Diversity Analysis by GETAWAY Descriptors, 2. Application of the Novel 3D Molecular Descriptors to QSAR/QSPR Studies

In a previous paper the theory of the new molecular descriptors called GETAWAY (GEometry, Topology, and Atom-Weights AssemblY) was explained. These descriptors have been proposed with the aim of matching 3D-molecular geometry, atom relatedness, and chemical information. In this paper prediction ability in structure-property correlations of GETAWAY descriptors has been tested extensively by anal...

متن کامل

GPU-accelerated Chemical Similarity Assessment for Large Scale Databases

The assessment of chemical similarity between molecules is a basic operation in chemoinformatics, a computational area concerning with the manipulation of chemical structural information. Comparing molecules is the basis for a wide range of applications such as searching in chemical databases, training prediction models for virtual screening or aggregating clusters of similar compounds. However...

متن کامل

A New Vision-Based and GPS-Signal-Independent Approach in Jamming Detection and UAV Absolute Positioning Assessment

The Unmanned Aerial Vehicles (UAV) positioning in the outdoor environment is usually done by the Global Positioning System (GPS). Due to the low power of the GPS signal at the earth surface, its performance disrupted in the contaminated environments with the jamming attacks. The UAV positioning and its accuracy using GPS will be degraded in the jamming attacks. A positioning error about tens of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004